Applicability of MAHOUT for Large Data Sets

نویسنده

  • Michael Sevilla
چکیده

In an attempt to unify two hot trends in the research community, this project explores distributed data mining. The idea of taking big data and processing it to extract information on a large number of nodes has many applications for many sectors outside of computer science. This project explores Apache Mahout and attempts to quantify its modifiability, accuracy, and performance. To determine the modifiability, the code is opened up and examined. To get an idea of the accuracy, the resulting output is scored and compared against the highest-scoring algorithm in an online competition for the same dataset. Finally, the performance is examined to see how well Mahout scales. This project shows the importance of understanding the fundamentals of any data mining solution before attempting to use it. Poor parameters and a general lack of understanding led to interesting results and, more importantly, a number of useful ”lessons learned”.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Deduplication in Parallel Mining of Frequent Item sets using MapReduce

A Parallel Frequent Item sets mining algorithm called FiDoop using MapReduce programming model. FiDoop includes the frequent items ultrametric tree(FIU-tree), in that three MapReduce jobs are applied to complete the mining task. The scalability problem has been addressed bythe implementation of a handful of FP-growth-like parallelFIM algorithms. InFiDoop, the mappers independently and concurren...

متن کامل

Feedback - Study and Improvement of the Random Forest of the Mahout library in the context of marketing data of Orange

In the realm of Big Data systems, Hadoop has emerged as one of the most popular systems and a very diverse ecosystem has grown around it, meeting all kinds of functional and technical needs. One niche that should have been a place of choice in this ecosystem is data analytics: first because getting value out of large datasets requires efficient Machine Learning (ML) algorithms, second because l...

متن کامل

Recommendation System Using Collaborative Filtering Algorithm Using Mahout

In decision making regarding an item/person recommendation system helps people. Increase in World Wide Web and E-commerce are the channel for recommendation system. Due to large size of data, recommendation system suffers from scalability problem. One of the solutions for this problem is Hadoop. Collaborative filtering is a machine learning algorithm and Mahout is an open source java library wh...

متن کامل

Application of Benford’s Law in Analyzing Geotechnical Data

Benford’s law predicts the frequency of the first digit of numbers met in a wide range of naturally occurring phenomena. In data sets, following Benford’s law, numbers are started with a small leading digit more often than those with a large leading digit. This law can be used as a tool for detecting fraud and abnormally in the number sets and any fabricated number sets. This can be used as an ...

متن کامل

Analysis and Evaluation of Similarity Metrics in Collaborative Filtering Recommender System

KEMI-TORNIO UNIVERSITY OF APPLIED SCIENCES Degree programme: Business Information Technology Writer: Guo, Shuhang Thesis title: Analysis and evaluation of similarity metrics in collaborative filtering recommender system Pages (of which appendix): 62 (1) Date: May 15, 2014 Thesis instructor: Ryabov, Vladimir This research is focused on the field of recommender systems. The general aims of this t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012